Search CORE

13 research outputs found

The End of Slow Networks: It's Time for a Redesign

Author: Binnig Carsten
Crotty Andrew
Galakatos Alex
Kraska Tim
Zamanian Erfan
Publication venue
Publication date: 19/12/2015
Field of study

Next generation high-performance RDMA-capable networks will require a fundamental rethinking of the design and architecture of modern distributed DBMSs. These systems are commonly designed and optimized under the assumption that the network is the bottleneck: the network is slow and "thin", and thus needs to be avoided as much as possible. Yet this assumption no longer holds true. With InfiniBand FDR 4x, the bandwidth available to transfer data across network is in the same ballpark as the bandwidth of one memory channel, and it increases even further with the most recent EDR standard. Moreover, with the increasing advances of RDMA, the latency improves similarly fast. In this paper, we first argue that the "old" distributed database design is not capable of taking full advantage of the network. Second, we propose architectural redesigns for OLTP, OLAP and advanced analytical frameworks to take better advantage of the improved bandwidth, latency and RDMA capabilities. Finally, for each of the workload categories, we show that remarkable performance improvements can be achieved

arXiv.org e-Print Archive

TUbiblio

Tupleware: Redefining Modern Analytics

Author: Cetintemel Ugur
Crotty Andrew
Dursun Kayhan
Galakatos Alex
Kraska Tim
Zdonik Stan
Publication venue
Publication date: 30/07/2014
Field of study

There is a fundamental discrepancy between the targeted and actual users of current analytics frameworks. Most systems are designed for the data and infrastructure of the Googles and Facebooks of the world---petabytes of data distributed across large cloud deployments consisting of thousands of cheap commodity machines. Yet, the vast majority of users operate clusters ranging from a few to a few dozen nodes, analyze relatively small datasets of up to a few terabytes, and perform primarily compute-intensive operations. Targeting these users fundamentally changes the way we should build analytics systems. This paper describes the design of Tupleware, a new system specifically aimed at the challenges faced by the typical user. Tupleware's architecture brings together ideas from the database, compiler, and programming languages communities to create a powerful end-to-end solution for data analysis. We propose novel techniques that consider the data, computations, and hardware together to achieve maximum performance on a case-by-case basis. Our experimental evaluation quantifies the impact of our novel techniques and shows orders of magnitude performance improvement over alternative systems

arXiv.org e-Print Archive

CiteSeerX

FITing-Tree: A Data-aware Index Structure

Author: Binnig Carsten
Fonseca Rodrigo
Galakatos Alex
Kraska Tim
Markovitch Michael
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Index structures are one of the most important tools that DBAs leverage to improve the performance of analytics and transactional workloads. However, building several indexes over large datasets can often become prohibitive and consume valuable system resources. In fact, a recent study showed that indexes created as part of the TPC-C benchmark can account for 55% of the total memory available in a modern DBMS. This overhead consumes valuable and expensive main memory, and limits the amount of space available to store new data or process existing data. In this paper, we present FITing-Tree, a novel form of a learned index which uses piece-wise linear functions with a bounded error specified at construction time. This error knob provides a tunable parameter that allows a DBA to FIT an index to a dataset and workload by being able to balance lookup performance and space consumption. To navigate this tradeoff, we provide a cost model that helps determine an appropriate error parameter given either (1) a lookup latency requirement (e.g., 500ns) or (2) a storage budget (e.g., 100MB). Using a variety of real-world datasets, we show that our index is able to provide performance that is comparable to full index structures while reducing the storage footprint by orders of magnitude.Comment: 18 page

arXiv.org e-Print Archive

TUbiblio

Crossref

DSpace@MIT

Vizdom: Interactive analytics through pen and touch

Author: Binnig Carsten
Crotty Andrew
Galakatos Alex
Kraska Tim
Zgraggen Emanuel
Publication venue: 'VLDB Endowment'
Publication date: 01/01/2015
Field of study

TUbiblio

The case for interactive data exploration accelerators (ideas)

Author: Binnig Carsten
Crotty Andrew
Galakatos Alex
Kraska Tim
Zgraggen Emanuel
Publication venue
Publication date: 01/01/2016
Field of study

TUbiblio

Crossref

Revisiting reuse for approximate query processing

Author: Binnig Carsten
Crotty Andrew
Galakatos Alex
Kraska Tim
Zgraggen Emanuel
Publication venue: 'VLDB Endowment'
Publication date: 01/01/2017
Field of study

TUbiblio

A-Tree: A Bounded Approximate Index Structure

Author: Binnig Carsten
Fonseca Rodrigo
Galakatos Alex
Kraska Tim
Markovitch Michael
Publication venue
Publication date: 01/01/2018
Field of study

TUbiblio

How Progressive Visualizations Affect Exploratory Analysis

Author: Crotty Andrew
Fekete Jean-Daniel
Galakatos Alex
Kraska Tim
Zgraggen Emanuel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

International audienceThe stated goal for visual data exploration is to operate at a rate that matches the pace of human data analysts, but the ever increasing amount of data has led to a fundamental problem: datasets are often too large to process within interactive time frames. Progressive analytics and visualizations have been proposed as potential solutions to this issue. By processing data incrementally in small chunks, progressive systems provide approximate query answers at interactive speeds that are then refined over time with increasing precision. We study how progressive visualizations affect users in exploratory settings in an experiment where we capture user behavior and knowledge discovery through interaction logs and think-aloud protocols. Our experiment includes three visualization conditions and different simulated dataset sizes. The visualization conditions are: (1) blocking, where results are displayed only after the entire dataset has been processed; (2) instantaneous, a hypothetical condition where results are shown almost immediately; and (3) progressive, where approximate results are displayed quickly and then refined over time. We analyze the data collected in our experiment and observe that users perform equally well with either instantaneous or progressive visualizations in key metrics, such as insight discovery rates and dataset coverage, while blocking visualizations have detrimental effects

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

An architecture for compiling udf-centric workflows

Author: Alex Galakatos
Andrew Crotty
Carsten Binnig
Kayhan Dursun
Stan Zdonik
Tim Kraska
Ugur Cetintemel
Publication venue
Publication date: 01/01/2015
Field of study

ABSTRACT Data analytics has recently grown to include increasingly sophisticated techniques, such as machine learning and advanced statistics. Users frequently express these complex analytics tasks as workflows of user-defined functions (UDFs) that specify each algorithmic step. However, given typical hardware configurations and dataset sizes, the core challenge of complex analytics is no longer sheer data volume but rather the computation itself, and the next generation of analytics frameworks must focus on optimizing for this computation bottleneck. While query compilation has gained widespread popularity as a way to tackle the computation bottleneck for traditional SQL workloads, relatively little work addresses UDF-centric workflows in the domain of complex analytics. In this paper, we describe a novel architecture for automatically compiling workflows of UDFs. We also propose several optimizations that consider properties of the data, UDFs, and hardware together in order to generate different code on a case-by-case basis. To evaluate our approach, we implemented these techniques in TUPLEWARE, a new high-performance distributed analytics system, and our benchmarks show performance improvements of up to three orders of magnitude compared to alternative systems

CiteSeerX